<!DOCTYPE html>
<html class="writer-html5" lang="en" >
<head>
  <meta charset="utf-8" />
  <meta name="viewport" content="width=device-width, initial-scale=1.0" />
  <title>Dataset workflow &mdash; Graph4NLP v0.4.1 documentation</title><link rel="stylesheet" href="../../_static/css/theme.css" type="text/css" />
    <link rel="stylesheet" href="../../_static/pygments.css" type="text/css" />
  <!--[if lt IE 9]>
    <script src="../../_static/js/html5shiv.min.js"></script>
  <![endif]-->
  <script id="documentation_options" data-url_root="../../" src="../../_static/documentation_options.js"></script>
        <script src="../../_static/jquery.js"></script>
        <script src="../../_static/underscore.js"></script>
        <script src="../../_static/doctools.js"></script>
        <script src="../../_static/language_data.js"></script>
        <script async="async" src="https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.5/latest.js?config=TeX-AMS-MML_HTMLorMML"></script>
    <script src="../../_static/js/theme.js"></script>
    <link rel="index" title="Index" href="../../genindex.html" />
    <link rel="search" title="Search" href="../../search.html" />
    <link rel="next" title="Customizing your own dataset" href="customize.html" />
    <link rel="prev" title="Chapter 2. Dataset" href="../dataset.html" /> 
</head>

<body class="wy-body-for-nav"> 
  <div class="wy-grid-for-nav">
    <nav data-toggle="wy-nav-shift" class="wy-nav-side">
      <div class="wy-side-scroll">
        <div class="wy-side-nav-search" >
            <a href="../../index.html" class="icon icon-home"> Graph4NLP
          </a>
<div role="search">
  <form id="rtd-search-form" class="wy-form" action="../../search.html" method="get">
    <input type="text" name="q" placeholder="Search docs" />
    <input type="hidden" name="check_keywords" value="yes" />
    <input type="hidden" name="area" value="default" />
  </form>
</div>
        </div><div class="wy-menu wy-menu-vertical" data-spy="affix" role="navigation" aria-label="Navigation menu">
              <p class="caption"><span class="caption-text">Get Started</span></p>
<ul>
<li class="toctree-l1"><a class="reference internal" href="../../welcome/installation.html">Install Graph4NLP</a></li>
</ul>
<p class="caption"><span class="caption-text">User Guide</span></p>
<ul class="current">
<li class="toctree-l1"><a class="reference internal" href="../graphdata.html">Chapter 1. Graph Data</a></li>
<li class="toctree-l1 current"><a class="reference internal" href="../dataset.html">Chapter 2. Dataset</a><ul class="current">
<li class="toctree-l2 current"><a class="current reference internal" href="#">Dataset workflow</a></li>
<li class="toctree-l2"><a class="reference internal" href="customize.html">Customizing your own dataset</a></li>
</ul>
</li>
<li class="toctree-l1"><a class="reference internal" href="../construction.html">Chapter 3. Graph Construction</a></li>
<li class="toctree-l1"><a class="reference internal" href="../gnn.html">Chapter 4. Graph Encoder</a></li>
<li class="toctree-l1"><a class="reference internal" href="../decoding.html">Chapter 5. Decoder</a></li>
<li class="toctree-l1"><a class="reference internal" href="../classification.html">Chapter 6. Classification</a></li>
<li class="toctree-l1"><a class="reference internal" href="../evaluation.html">Chapter 7. Evaluations and Loss components</a></li>
</ul>
<p class="caption"><span class="caption-text">Module API references</span></p>
<ul>
<li class="toctree-l1"><a class="reference internal" href="../../modules/data.html">graph4nlp.data</a></li>
<li class="toctree-l1"><a class="reference internal" href="../../modules/datasets.html">graph4nlp.datasets</a></li>
<li class="toctree-l1"><a class="reference internal" href="../../modules/graph_construction.html">graph4nlp.graph_construction</a></li>
<li class="toctree-l1"><a class="reference internal" href="../../modules/graph_embedding.html">graph4nlp.graph_embedding</a></li>
<li class="toctree-l1"><a class="reference internal" href="../../modules/prediction.html">graph4nlp.prediction</a></li>
<li class="toctree-l1"><a class="reference internal" href="../../modules/loss.html">graph4nlp.loss</a></li>
<li class="toctree-l1"><a class="reference internal" href="../../modules/evaluation.html">graph4nlp.evaluation</a></li>
</ul>
<p class="caption"><span class="caption-text">Tutorials</span></p>
<ul>
<li class="toctree-l1"><a class="reference internal" href="../../tutorial/text_classification.html">Text Classification Tutorial</a></li>
<li class="toctree-l1"><a class="reference internal" href="../../tutorial/semantic_parsing.html">Semantic Parsing Tutorial</a></li>
<li class="toctree-l1"><a class="reference internal" href="../../tutorial/math_word_problem.html">Math Word Problem Tutorial</a></li>
<li class="toctree-l1"><a class="reference internal" href="../../tutorial/knowledge_graph_completion.html">Knowledge Graph Completion Tutorial</a></li>
</ul>

        </div>
      </div>
    </nav>

    <section data-toggle="wy-nav-shift" class="wy-nav-content-wrap"><nav class="wy-nav-top" aria-label="Mobile navigation menu" >
          <i data-toggle="wy-nav-top" class="fa fa-bars"></i>
          <a href="../../index.html">Graph4NLP</a>
      </nav>

      <div class="wy-nav-content">
        <div class="rst-content">
          <div role="navigation" aria-label="Page navigation">
  <ul class="wy-breadcrumbs">
      <li><a href="../../index.html" class="icon icon-home"></a> &raquo;</li>
          <li><a href="../dataset.html">Chapter 2. Dataset</a> &raquo;</li>
      <li>Dataset workflow</li>
      <li class="wy-breadcrumbs-aside">
            <a href="../../_sources/guide/dataset/workflow.rst.txt" rel="nofollow"> View page source</a>
      </li>
  </ul>
  <hr/>
</div>
          <div role="main" class="document" itemscope="itemscope" itemtype="http://schema.org/Article">
           <div itemprop="articleBody">
             
  <div class="section" id="dataset-workflow">
<span id="guide-workflow"></span><h1>Dataset workflow<a class="headerlink" href="#dataset-workflow" title="Permalink to this headline">¶</a></h1>
<p>To use a dataset for experiment, we first need to get the original data files from online access. Then some pre-processing
may need to be performed on the raw data to get some processed files.
The processed data will be loaded to memory for training. During training, the whole dataset is sliced into small batches
and fed to the model iteratively.
According to the workflow described above, a typical dataset workflow consists of 4 steps: downloading raw data, pre-processing
raw data, loading data and iterating over it.
The following figure illustrates this workflow:</p>
<div class="figure align-center" id="id1">
<a class="reference internal image-reference" href="../../_images/workflow.png"><img alt="../../_images/workflow.png" src="../../_images/workflow.png" style="width: 600px;" /></a>
<p class="caption"><span class="caption-text">Dataset Workflow</span><a class="headerlink" href="#id1" title="Permalink to this image">¶</a></p>
</div>
<p>For the first two steps, we need to specify the raw data and processed data file names. In Graph4NLP’s convention, similar to
<a class="reference external" href="https://pytorch-geometric.readthedocs.io/en/latest/">PyG</a>, the raw data file is stored under the <code class="docutils literal notranslate"><span class="pre">raw</span></code> directory under
the dataset’s root directory. Similarly, the processed data file is stored under the <code class="docutils literal notranslate"><span class="pre">processed</span></code> sub-directory.</p>
<div class="highlight-default notranslate"><div class="highlight"><pre><span></span><span class="k">class</span> <span class="nc">Dataset</span><span class="p">:</span>
    <span class="nd">@property</span>
    <span class="k">def</span> <span class="nf">raw_dir</span><span class="p">(</span><span class="bp">self</span><span class="p">)</span> <span class="o">-&gt;</span> <span class="nb">str</span><span class="p">:</span>
        <span class="sd">&quot;&quot;&quot;The directory where the raw data is stored.&quot;&quot;&quot;</span>
        <span class="k">return</span> <span class="n">os</span><span class="o">.</span><span class="n">path</span><span class="o">.</span><span class="n">join</span><span class="p">(</span><span class="bp">self</span><span class="o">.</span><span class="n">root</span><span class="p">,</span> <span class="s1">&#39;raw&#39;</span><span class="p">)</span>

    <span class="nd">@property</span>
    <span class="k">def</span> <span class="nf">processed_dir</span><span class="p">(</span><span class="bp">self</span><span class="p">)</span> <span class="o">-&gt;</span> <span class="nb">str</span><span class="p">:</span>
        <span class="k">return</span> <span class="n">os</span><span class="o">.</span><span class="n">path</span><span class="o">.</span><span class="n">join</span><span class="p">(</span><span class="bp">self</span><span class="o">.</span><span class="n">root</span><span class="p">,</span> <span class="s1">&#39;processed&#39;</span><span class="p">,</span> <span class="bp">self</span><span class="o">.</span><span class="n">topology_subdir</span><span class="p">)</span>
</pre></div>
</div>
<p>On initializing a dataset object, the <code class="docutils literal notranslate"><span class="pre">Dataset</span></code> class first checks if the raw files exist. If not, then the <code class="docutils literal notranslate"><span class="pre">_download()</span></code>
routine is triggered to download the raw data files.</p>
<p>After checking the raw data, <code class="docutils literal notranslate"><span class="pre">Dataset</span></code> then checks if the processed files exist. If not, the <code class="docutils literal notranslate"><span class="pre">_process()</span></code> routine is
triggered to process the raw data and save the processed files locally.</p>
<p>After these two steps, the remaining works can be reduced to a typical <code class="docutils literal notranslate"><span class="pre">torch.Dataset</span></code> and <code class="docutils literal notranslate"><span class="pre">torch.DataLoader</span></code> workflow.</p>
</div>


           </div>
          </div>
          <footer><div class="rst-footer-buttons" role="navigation" aria-label="Footer">
        <a href="../dataset.html" class="btn btn-neutral float-left" title="Chapter 2. Dataset" accesskey="p" rel="prev"><span class="fa fa-arrow-circle-left" aria-hidden="true"></span> Previous</a>
        <a href="customize.html" class="btn btn-neutral float-right" title="Customizing your own dataset" accesskey="n" rel="next">Next <span class="fa fa-arrow-circle-right" aria-hidden="true"></span></a>
    </div>

  <hr/>

  <div role="contentinfo">
    <p>&#169; Copyright 2020, Graph4AI Group.</p>
  </div>

  Built with <a href="https://www.sphinx-doc.org/">Sphinx</a> using a
    <a href="https://github.com/readthedocs/sphinx_rtd_theme">theme</a>
    provided by <a href="https://readthedocs.org">Read the Docs</a>.
   

</footer>
        </div>
      </div>
    </section>
  </div>
  <script>
      jQuery(function () {
          SphinxRtdTheme.Navigation.enable(true);
      });
  </script> 

</body>
</html>